74 research outputs found
Scalable Boolean Tensor Factorizations using Random Walks
Tensors are becoming increasingly common in data mining, and consequently,
tensor factorizations are becoming more and more important tools for data
miners. When the data is binary, it is natural to ask if we can factorize it
into binary factors while simultaneously making sure that the reconstructed
tensor is still binary. Such factorizations, called Boolean tensor
factorizations, can provide improved interpretability and find Boolean
structure that is hard to express using normal factorizations. Unfortunately
the algorithms for computing Boolean tensor factorizations do not usually scale
well. In this paper we present a novel algorithm for finding Boolean CP and
Tucker decompositions of large and sparse binary tensors. In our experimental
evaluation we show that our algorithm can handle large tensors and accurately
reconstructs the latent Boolean structure
Reductions for Frequency-Based Data Mining Problems
Studying the computational complexity of problems is one of the - if not the
- fundamental questions in computer science. Yet, surprisingly little is known
about the computational complexity of many central problems in data mining. In
this paper we study frequency-based problems and propose a new type of
reduction that allows us to compare the complexities of the maximal frequent
pattern mining problems in different domains (e.g. graphs or sequences). Our
results extend those of Kimelfeld and Kolaitis [ACM TODS, 2014] to a broader
range of data mining problems. Our results show that, by allowing constraints
in the pattern space, the complexities of many maximal frequent pattern mining
problems collapse. These problems include maximal frequent subgraphs in
labelled graphs, maximal frequent itemsets, and maximal frequent subsequences
with no repetitions. In addition to theoretical interest, our results might
yield more efficient algorithms for the studied problems.Comment: This is an extended version of a paper of the same title to appear in
the Proceedings of the 17th IEEE International Conference on Data Mining
(ICDM'17
Algorithms for Approximate Subtropical Matrix Factorization
Matrix factorization methods are important tools in data mining and analysis.
They can be used for many tasks, ranging from dimensionality reduction to
visualization. In this paper we concentrate on the use of matrix factorizations
for finding patterns from the data. Rather than using the standard algebra --
and the summation of the rank-1 components to build the approximation of the
original matrix -- we use the subtropical algebra, which is an algebra over the
nonnegative real values with the summation replaced by the maximum operator.
Subtropical matrix factorizations allow "winner-takes-it-all" interpretations
of the rank-1 components, revealing different structure than the normal
(nonnegative) factorizations. We study the complexity and sparsity of the
factorizations, and present a framework for finding low-rank subtropical
factorizations. We present two specific algorithms, called Capricorn and
Cancer, that are part of our framework. They can be used with data that has
been corrupted with different types of noise, and with different error metrics,
including the sum-of-absolute differences, Frobenius norm, and Jensen--Shannon
divergence. Our experiments show that the algorithms perform well on data that
has subtropical structure, and that they can find factorizations that are both
sparse and easy to interpret.Comment: 40 pages, 9 figures. For the associated source code, see
http://people.mpi-inf.mpg.de/~pmiettin/tropical
On Defining SPARQL with Boolean Tensor Algebra
The Resource Description Framework (RDF) represents information as
subject-predicate-object triples. These triples are commonly interpreted as a
directed labelled graph. We propose an alternative approach, interpreting the
data as a 3-way Boolean tensor. We show how SPARQL queries - the standard
queries for RDF - can be expressed as elementary operations in Boolean algebra,
giving us a complete re-interpretation of RDF and SPARQL. We show how the
Boolean tensor interpretation allows for new optimizations and analyses of the
complexity of SPARQL queries. For example, estimating the size of the results
for different join queries becomes much simpler
Clustering Boolean Tensors
Tensor factorizations are computationally hard problems, and in particular,
are often significantly harder than their matrix counterparts. In case of
Boolean tensor factorizations -- where the input tensor and all the factors are
required to be binary and we use Boolean algebra -- much of that hardness comes
from the possibility of overlapping components. Yet, in many applications we
are perfectly happy to partition at least one of the modes. In this paper we
investigate what consequences does this partitioning have on the computational
complexity of the Boolean tensor factorizations and present a new algorithm for
the resulting clustering problem. This algorithm can alternatively be seen as a
particularly regularized clustering algorithm that can handle extremely
high-dimensional observations. We analyse our algorithms with the goal of
maximizing the similarity and argue that this is more meaningful than
minimizing the dissimilarity. As a by-product we obtain a PTAS and an efficient
0.828-approximation algorithm for rank-1 binary factorizations. Our algorithm
for Boolean tensor clustering achieves high scalability, high similarity, and
good generalization to unseen data with both synthetic and real-world data
sets
Latitude: A Model for Mixed Linear-Tropical Matrix Factorization
Nonnegative matrix factorization (NMF) is one of the most frequently-used
matrix factorization models in data analysis. A significant reason to the
popularity of NMF is its interpretability and the `parts of whole'
interpretation of its components. Recently, max-times, or subtropical, matrix
factorization (SMF) has been introduced as an alternative model with equally
interpretable `winner takes it all' interpretation. In this paper we propose a
new mixed linear--tropical model, and a new algorithm, called Latitude, that
combines NMF and SMF, being able to smoothly alternate between the two. In our
model, the data is modeled using the latent factors and latent parameters that
control whether the factors are interpreted as NMF or SMF features, or their
mixtures. We present an algorithm for our novel matrix factorization. Our
experiments show that our algorithm improves over both baselines, and can yield
interpretable results that reveal more of the latent structure than either NMF
or SMF alone.Comment: 14 pages, 6 figures. To appear in 2018 SIAM International Conference
on Data Mining (SDM '18). For the source code, see
https://people.mpi-inf.mpg.de/~pmiettin/linear-tropical
Boolean matrix factorization meets consecutive ones property
Boolean matrix factorization is a natural and a popular technique for summarizing binary matrices. In this paper, we study a problem of Boolean matrix factorization where we additionally require that the factor matrices have consecutive ones property (OBMF). A major application of this optimization problem comes from graph visualization: standard techniques for visualizing graphs are circular or linear layout, where nodes are ordered in circle or on a line. A common problem with visualizing graphs is clutter due to too many edges. The standard approach to deal with this is to bundle edges together and represent them as ribbon. We also show that we can use OBMF for edge bundling combined with circular or linear layout techniques. We demonstrate that not only this problem is NP-hard but we cannot have a polynomial-time algorithm that yields a multiplicative approximation guarantee (unless P = NP). On the positive side, we develop a greedy algorithm where at each step we look for the best 1-rank factorization. Since even obtaining 1-rank factorization is NP-hard, we propose an iterative algorithm where we fix one side and and find the other, reverse the roles, and repeat. We show that this step can be done in linear time using pq-trees. We also extend the problem to cyclic ones property and symmetric factorizations. Our experiments show that our algorithms find high-quality factorizations and scale well.Peer reviewe
What you will gain by rounding : theory and algorithms for rounding rank
When factorizing binary matrices, we often have to make a choice between using expensive combinatorial methods
that retain the discrete nature of the data and using continuous methods that can be more efficient but destroy the discrete structure. Alternatively, we can first compute a continuous factorization and subsequently apply a rounding procedure to obtain a discrete representation. But what will we gain by rounding? Will this yield lower reconstruction errors? Is it easy
to find a low-rank matrix that rounds to a given binary matrix? Does it matter which threshold we use for rounding? Does it
matter if we allow for only non-negative factorizations? In this paper, we approach these and further questions by presenting
and studying the concept of rounding rank. We show that rounding rank is related to linear classification, dimensionality
reduction, and nested matrices. We also report on an extensive experimental study that compares different algorithms for finding good factorizations under the rounding rank model
- …